[00:00.000 --> 00:07.340]  So hi, I'm Justin Pagliarani. I'm a staff security engineer at Flatiron. What that means is I'm the
[00:07.340 --> 00:13.100]  team lead for the cloud security team. Prior to Flatiron, I worked at Bishop Fox as a pen tester
[00:13.760 --> 00:19.020]  and various quasi-governmental organizations before that. So I've been around for a little
[00:19.020 --> 00:26.120]  while. Yeah, I guess that's kind of me. Suheer? Sure. So I'm a senior security engineer at Prime
[00:26.120 --> 00:31.100]  Health. I've been here like three years. And then before that, I was a software engineer and did
[00:31.100 --> 00:36.220]  software engineering for like three years. And that's pretty much about me. Yep. So for anybody
[00:36.220 --> 00:42.260]  who doesn't know about Flatiron, we basically use technology to try and fight cancer. We're not
[00:42.260 --> 00:45.680]  going to be talking about any of that today. We're going to be talking about cloud security
[00:47.000 --> 00:52.160]  misconfiguration, detection, and remediation. So kind of going into that, we can move to the
[00:52.160 --> 00:55.140]  next slide. Sorry, we were playing the slide dance. We got some...
[00:57.920 --> 01:01.900]  Scott Piper from Summit Route has helped us out quite a bit with this, as well as Reza,
[01:01.900 --> 01:07.060]  who is our summer intern. He's a PhD student, and he did some really great work for us as well,
[01:07.060 --> 01:14.000]  helping us build this out and put it to good use in our environment. So kind of on from there,
[01:14.000 --> 01:17.900]  we'll get into the agenda. We're going to kind of give you the background of the problem we were
[01:17.900 --> 01:24.740]  facing, kind of our first iteration of a solution, kind of where we're at now, the things that we've
[01:24.740 --> 01:32.980]  solved for, a quick demo, and a Q&A. So kind of our general background here is we kind of wanted
[01:32.980 --> 01:38.840]  to know how our critical resources are configured. We wanted to be able to detect if things become
[01:38.840 --> 01:45.640]  configured badly, and if they become misconfigured, can we fix it in a reasonable amount of time
[01:45.640 --> 01:51.320]  automatically without having to be on call at night, because nobody wants to do that.
[01:51.400 --> 02:01.320]  So kind of generally why we want to do this is because cloud configuration problems are
[02:02.120 --> 02:08.400]  pretty easy to accidentally do in the new kind of DevOps, well, newish DevOps paradigm,
[02:08.400 --> 02:14.120]  where we've now decided that the software engineers or sysadmins, mistakes happen,
[02:14.120 --> 02:20.060]  and we end up with lots of things like, you know, S3 buckets being exposed, you know,
[02:20.060 --> 02:26.860]  people losing IAM tokens through IMDS, all kinds of weird things, basically through like
[02:26.860 --> 02:31.760]  silly misconfigurations that if you kind of like thought through, you know, you would catch pretty
[02:31.760 --> 02:35.540]  easily, like having an S3 bucket be available publicly when it has sensitive information
[02:35.540 --> 02:41.680]  is like logically a very simple mistake. But because of the complexities of how we are
[02:41.680 --> 02:48.320]  building and utilizing cloud resources, it's something that kind of happens a lot. So at
[02:48.320 --> 02:53.980]  Flatiron, we handle millions and millions of patient records, you know, raw PII from doctors.
[02:54.040 --> 03:01.080]  So, you know, an S3 bucket with that information becoming publicly available would one, be a
[03:01.080 --> 03:06.180]  disaster for us from a business standpoint, but also be a disaster for our patients. And when
[03:06.180 --> 03:10.500]  you're dealing with cancer patients, the last damn thing they need is to be dealing with identity
[03:10.500 --> 03:15.220]  theft, or even the thoughts of it. You know, we don't we don't want to do that. Stress is
[03:15.220 --> 03:25.280]  bad for health outcomes. So we're trying to avoid that. So kind of our first iteration on going
[03:25.280 --> 03:31.980]  about fixing this was to use lambdas, you know, just do serverless bits of code to check over
[03:31.980 --> 03:37.060]  our resources every 10 minutes or so. We had a bunch of lambdas checking for a lot of different
[03:37.060 --> 03:42.740]  things. It generally would do one lambda per resource or one lambda per misconfiguration
[03:42.740 --> 03:47.520]  that would detect an issue and then fix it automatically, just kind of like your standard
[03:47.520 --> 03:52.140]  polling model. And when we first started doing this, we had many fewer resources. We were a much
[03:52.140 --> 03:58.640]  smaller company. It worked great. Things were fantastic. And it was great. So kind of the
[03:58.640 --> 04:04.420]  general architecture of how this worked was we had separate lambdas for each remediation. We had
[04:04.420 --> 04:10.020]  cloud watch events or we would be watching the cloud watch events from our primary account,
[04:10.020 --> 04:15.020]  which we call our security account. And then the lambda would assume roles into each of our other
[04:15.020 --> 04:18.780]  accounts to kind of look at resources that way. So we kind of had like a centralized, isolated
[04:20.100 --> 04:24.080]  deputy that would then kind of move on to the different accounts and monitor all the resources
[04:24.080 --> 04:30.300]  and then fix them if things were broken. And it worked. You know, we had great things. So,
[04:30.300 --> 04:35.600]  we would be getting JIRA tickets that would tell us like, oh, somebody stood up this S3
[04:35.600 --> 04:40.500]  bucket and it was public accidentally, but we fixed it right away. So no harm, no foul.
[04:44.560 --> 04:50.780]  A fidelity of about 10 minutes. So that was not so bad. You know, it worked. It was great. Why
[04:50.780 --> 04:56.500]  did we move on? Well, the general problem was we were using tons of API calls, lots and lots. And
[04:56.500 --> 05:01.300]  as the environment grew, it got to be worse. That got to be very costly. It was causing production
[05:01.300 --> 05:08.860]  issues and everyone was kind of mad at us. It was not a great thing. So, you know, we ended up with
[05:08.860 --> 05:16.560]  lots of rate limiting, which was, I mean, not the worst thing in the world, but it wasn't great. And
[05:16.560 --> 05:21.000]  it became more and more of a problem as time went on. So, you know, we kind of wanted to avoid this,
[05:21.000 --> 05:27.460]  you know, every 10 minutes things were starting to get screwed up. So kind of our general
[05:27.460 --> 05:33.540]  post-mortem was we had these cron, like basically cron jobs running to check these things. It worked
[05:33.540 --> 05:40.080]  very well. But, you know, polling is just kind of crappy in general. So we wanted to move away
[05:40.080 --> 05:45.160]  from that and move into something better because just the number of API calls in a cloud environment
[05:45.160 --> 05:49.520]  was getting costly from a number of different standpoints. And then also just the general
[05:49.520 --> 05:57.440]  fidelity of having a 10 minute delay in response time was not necessarily great. You know, if we
[05:57.440 --> 06:03.780]  could act a little bit more, I guess a quicker response would be better, you know, when you're
[06:03.780 --> 06:12.860]  dealing with as much PII as we do. So kind of enter the remediation framework V2. So Saheer, take it
[06:12.860 --> 06:20.060]  away. All right. So all these problems which we saw before, like multiple API calls and also having
[06:20.060 --> 06:25.280]  exponential backoffs, which were like impacting other stuff, right? And the issue also was like
[06:25.280 --> 06:31.340]  we had like cron based polling, which will depend on like how is your cron frequency. Like we had
[06:31.340 --> 06:36.340]  like 10 minutes, but for some reasons, for some stuff we had like 20 minutes also. It was not real
[06:36.340 --> 06:42.880]  time or it was not that easy to scale. So that all brought us to like work on the V2.0 of that,
[06:42.880 --> 06:48.560]  like how we can like aggregate all of these independent lambdas into one unified code base.
[06:48.560 --> 06:53.060]  So instead of like multiple lambdas, we have one single stuff where we have like multiple modules
[06:53.060 --> 07:00.260]  for like let's say S3, lambda, RDS, or AMIs, or EC2s. So we try to like unify everything and try
[07:00.260 --> 07:06.020]  to add like event driven models. So because we know like what events would be responsible for
[07:06.020 --> 07:10.640]  making things, like for example, like when we talk about S3 buckets, you would know that put bucket
[07:10.640 --> 07:16.340]  ACL or put bucket policy would be one of those calls, like which would make the bucket public,
[07:16.340 --> 07:21.600]  right? By making either slapping an ACL or slapping a bucket policy, which basically exposes
[07:21.600 --> 07:27.020]  your S3 bucket. Similarly, we have the same API calls for other resources, like let's say
[07:27.020 --> 07:33.100]  EC2 AMIs or snapshots. So we wanted to capture on that and make a framework which can tap onto
[07:33.100 --> 07:38.940]  these events which are happening and then maybe trigger our framework. So if we move ahead,
[07:38.940 --> 07:45.720]  we'll see the architecture, like how exactly it is set up. So the V2.0 is basically,
[07:45.720 --> 07:53.000]  you can use like multiple accounts with it, multiple regions. So let's say if you have like
[07:53.000 --> 07:57.740]  10 accounts, you can basically configure your members in 10 accounts and also the multiple
[07:57.740 --> 08:04.940]  regions. So how we do here is we have our CloudTrail event rules, which basically capture
[08:04.940 --> 08:11.420]  on the events which happen in, let's say, account A. And those events could be like, hey, like EC2
[08:11.420 --> 08:19.640]  modify snapshot attribute. And what happens here is that sends the events on event bus
[08:19.640 --> 08:24.780]  to your master account where your remediator function lives. So event bus basically will
[08:24.780 --> 08:30.160]  send that to the master account. And then on the master account, you have same pattern rules which
[08:30.160 --> 08:36.200]  would capture on the EC2 modify snapshot attribute. And then that is captured by the
[08:36.200 --> 08:42.820]  event translator, which is like Lambda. So it will capture the raw event and then massage it
[08:42.820 --> 08:48.680]  and send that to SQS queue. And then the SQS queue is piped to the remediator Lambda, basically.
[08:48.800 --> 08:53.440]  And the Lambda, remediator Lambda will basically look at the resource, try to enumerate it,
[08:53.440 --> 08:58.020]  what does it look like? Does the snapshot look like it has been shared publicly? Does the snapshot
[08:58.020 --> 09:02.580]  look like it has been shared with a random AWS account, which probably we don't own?
[09:02.580 --> 09:07.720]  Based on that decision, the remediator Lambda will make the decision to remediate it or not
[09:07.720 --> 09:14.480]  remediate it. So there are event translator and event forwarder here and remediator function,
[09:14.480 --> 09:23.140]  which makes it event-driven. Moving on, like we mentioned the limitations of V1,
[09:23.140 --> 09:29.140]  which was based on the pooling model and we were using Cron for triggering our previous Lambdas
[09:29.140 --> 09:34.960]  every 10 minutes or 20 minutes. So we thought we would maybe get rid of the pooling-based stuff,
[09:34.960 --> 09:39.620]  but not really. So we still have Cron, we still have the pooling-based approach
[09:40.160 --> 09:47.500]  in our V2 where we pool for certain issues, let's say, for example, missing MFA or old access keys
[09:47.500 --> 09:53.580]  of inactive users. But some of these events are really hard to capture on your Cloudflare API logs
[09:53.580 --> 09:59.500]  or the event rules, like let's say old access keys, how do you capture on the API calls?
[09:59.500 --> 10:05.180]  You have to enumerate the user, find when it was created, when it was used and stuff.
[10:05.180 --> 10:10.620]  So we have the pooling stuff, which basically runs on a Cron, say nightly or whatnot. And then
[10:10.620 --> 10:17.320]  what it does is it will go to each account, assume a role and do the enumeration and then send
[10:17.320 --> 10:22.820]  everything, every finding to an SQS queue, which is sent to a remediator function, same remediator
[10:22.820 --> 10:27.300]  function, which will again enumerate the resource and make the decision if it should be remediated
[10:27.300 --> 10:37.360]  or not. All right, so now who needs API calls? So what we see here is we saw a big drop in number
[10:37.360 --> 10:42.660]  of API calls which our remediator function was making. The one spike you see is the polar function,
[10:42.660 --> 10:48.280]  which was running nightly for describing the resources. And also if we move ahead and see
[10:48.280 --> 10:53.560]  how many request limit exceeded errors we got, it was almost like zero. Maybe it won't translate
[10:53.560 --> 11:00.760]  exactly the same for you, but what we saw was we experienced zero pushbacks from AWS.
[11:05.010 --> 11:09.790]  Okay, so what are we watching exactly? So these are the list of things which we are watching
[11:09.790 --> 11:16.050]  currently. So if you look at the EC2, we have a check for publicly accessible AMI or
[11:16.050 --> 11:22.690]  an AMI shared with an unknown account, or EBS snapshots publicly exposed, or same applies if
[11:22.690 --> 11:27.790]  it is shared with an unknown account. Missing tags, IMDS v2. And then one extra thing which we have
[11:27.790 --> 11:33.770]  here is a check for publicly exposed EC2 instances. And what we have done is they categorize
[11:34.370 --> 11:40.070]  that off like a dev account. So you have like an enrollment variable where you can specify what
[11:40.070 --> 11:44.970]  are your dev accounts. And let's say if you want more stringent controls on that account, you can
[11:44.970 --> 11:50.670]  make that rule that hey, no EC2 should be exposed in dev account. And similarly, we have like
[11:50.670 --> 11:56.730]  multiple other remediation for IAM users, like look for inactive console access, look for inactive
[11:56.730 --> 12:04.370]  access keys, missing MFA for RDS if RDS is publicly exposed, if the snapshot is exposed,
[12:04.370 --> 12:10.770]  or if the storage is not encrypted. Similarly, it applies to Redshift, S3, and IAM role if it's
[12:10.770 --> 12:17.210]  or Lambda. I think, yeah, this list is like pretty much like self-explanatory.
[12:17.990 --> 12:24.830]  All right, so moving on to the demo, which we have, let me... Yeah, and to jump in real quick,
[12:24.830 --> 12:31.390]  adding remediations to this is also, you know, pretty easy. It's not like this is just hard
[12:31.390 --> 12:37.370]  code. It is like a monolith. It's very modular. So writing this is really as easy as writing a
[12:37.370 --> 12:42.170]  pretty simple Lambda that checks whatever it is that you're wanting to check. So a lot of these
[12:42.170 --> 12:48.050]  are sort of kind of designed around the requirements of our environment, but we've
[12:48.050 --> 12:52.470]  built this in such a way that you can add things to your environment. So for whatever reason you
[12:52.470 --> 12:57.810]  want all of your assets to be publicly available in your DMZ account or something, like you can
[12:57.810 --> 13:02.690]  enforce that kind of thing too. It's really not too complicated to write one of these modules,
[13:02.690 --> 13:07.490]  and it's basically plug and play. Yeah, exactly. Yeah, so like just to mention, so our main
[13:07.490 --> 13:13.290]  intention was like make it modular so you can add more controls. I can actually show that, like if I
[13:13.950 --> 13:30.600]  go to GitHub, AWS. All right, so if you go to resources, just a small look here on the remediator
[13:32.460 --> 13:36.120]  and auditors, so you basically have like multiple checks for each module. So what
[13:36.120 --> 13:39.600]  you can really maybe add for things that you want to watch for, let's say you want to watch
[13:39.600 --> 13:45.260]  for API gateways or you want to watch for ECS and whatnot. So you can basically add a module here.
[13:46.720 --> 13:51.320]  All right, so jumping onto the demo, let's make some things like publicly exposed. Let's make an
[13:51.320 --> 13:59.860]  EMI exposed. So we have a test EMI here, and what we are going to do is modify image permissions
[13:59.860 --> 14:05.840]  and make it publicly exposed. Save it. Let's do the same thing for snapshots meanwhile.
[14:08.670 --> 14:18.150]  So I have this, modify permissions, make it public, save it, go back to the EMI,
[14:18.150 --> 14:24.710]  refresh. Most likely it might be done by now, maybe not, we'll see. So it was made public,
[14:24.710 --> 14:31.010]  now it has been converted to private. It's almost real time, like probably took like
[14:31.010 --> 14:37.230]  three, four seconds, I think. And go back to snapshots. If we look in the permissions,
[14:37.230 --> 14:43.910]  it's private. So we can do the same exercise for S3 also. Let's go to S3
[14:45.650 --> 14:54.070]  and just choose any of the buckets. Let's slap public access, list object,
[14:54.070 --> 14:59.610]  and I can see because something is blocking my view. All right, so ACL public,
[15:00.510 --> 15:09.030]  and we can do similar stuff for lambdas, wait for IAM role also. IAM, let's go to role,
[15:09.030 --> 15:15.750]  I might have a test role here. What we'll do is edit trust relationship and just allow everyone
[15:15.750 --> 15:21.990]  with the principal star to assume this role. Update the trust relationship. Yeah, sure.
[15:22.630 --> 15:27.250]  It says overly permissive policies. Let's go back and we'll see if S3 has been fixed.
[15:29.070 --> 15:36.050]  Okay, so the ACL was removed and it's back again private. Let's see IAM, what has happened to this
[15:36.050 --> 15:44.540]  thing. Okay, so it basically changed the trust relationship and added deny on principal star.
[15:44.540 --> 15:50.700]  So basically, no one can assume this role. So yeah, so basically, what you can do is that you
[15:50.700 --> 15:56.580]  can have like multiple services and it basically will respond to it almost in real time. Let's say
[15:56.580 --> 16:02.400]  you're making RDS or modifying an RDS instance and making it publicly exposed. It will capture
[16:02.400 --> 16:09.000]  the event and try to see if the star configuration right now is bad. As in, is it publicly exposed?
[16:09.000 --> 16:13.980]  Is it not encrypted? And make the decision. Some things which are challenging is RDS and Redshift
[16:13.980 --> 16:19.400]  because when you spin up an RDS, what happens is it takes a shitload of time, like 20 minutes,
[16:19.400 --> 16:25.140]  30 minutes. So when you're using an event-driven approach, the lambdas will trigger immediately,
[16:25.140 --> 16:30.100]  but when they try to enumerate the RDS instance, it might not be in the available state.
[16:30.560 --> 16:35.660]  And that what happens is like your lambdas will see and that it's not available and they'll skip
[16:35.660 --> 16:40.800]  it. But once your RDS is available and it is public facing and you don't and then you can't
[16:40.800 --> 16:46.420]  catch it unless your polar runs like by a cron. So what we have done is like we have added some
[16:46.420 --> 16:51.760]  like repeat invocation, like let's check for the RDS. If it is not available, maybe repeat
[16:51.760 --> 16:56.620]  the same message to SQS, send it again like after like five minutes so it will keep on checking
[16:56.620 --> 17:00.780]  every five minutes. Is it available now? Is it available now? If it is available, check it and
[17:00.780 --> 17:04.960]  then make the decision like if it is public, make it private. If it is not encrypted, stop the
[17:04.960 --> 17:17.780]  instance. All right, so let's jump on to the Q&A. Do we have any questions?
[17:19.320 --> 17:26.840]  Oh yeah, a couple of questions for me. So in this case, those little modules that you showed
[17:26.840 --> 17:33.320]  is when it comes to detecting something as well as responding to it. The response part,
[17:33.320 --> 17:39.800]  is that also built into each of the modules? Is that how? So yeah, so basically what happens
[17:39.800 --> 17:50.160]  I'll show you. If I go to the code, so if I go to resources, event translator,
[17:51.080 --> 17:56.420]  main.py. So what happens like for each of these API calls, which are saying like create queue,
[17:56.420 --> 18:02.560]  set SQS, queue attributes. So you have like for DB, we have like some delays like 120
[18:03.200 --> 18:08.180]  seconds. Similarly for these, like they are almost real-time, right? Whenever this event is seen,
[18:08.180 --> 18:14.420]  forward the event almost real-time to the SQS queue, and then it will trigger the
[18:14.420 --> 18:19.020]  remediator function. But some of the other calls we have, like we have added some delays.
[18:21.860 --> 18:27.980]  Yeah, yes, yes. And then kind of the beauty of some of this is that because it's all relatively
[18:27.980 --> 18:34.760]  short Python code, if you wanted to integrate this with like, I don't know, if you have like
[18:34.760 --> 18:40.380]  some kind of compliance tool that has detection or remediation built in and has like an API,
[18:40.380 --> 18:46.080]  you can pretty easily leverage those like REST APIs or whatever to do the detection and the
[18:46.080 --> 18:50.700]  remediation for you. And this just kind of ties all the automation together because most of those
[18:50.700 --> 18:54.920]  tools that we've experimented with don't really have that. So that's something we're kind of
[18:54.920 --> 19:00.720]  working on building out as well as leveraging some of the other like detection and response
[19:00.720 --> 19:05.280]  tools so that this kind of just glues it all together and makes it event-driven. So you have
[19:05.280 --> 19:11.140]  lots of flexibility to do stuff like that. And the remediator modules are relatively short. It's
[19:11.140 --> 19:15.460]  basically just examining the event and then doing, you know, making API calls to do what it is you
[19:15.460 --> 19:21.020]  want to do. So yeah, the remediations are built in and very easily customizable. Yeah, that basically
[19:21.020 --> 19:27.640]  almost answered my second question, which was, you know, the polar function, which detects something,
[19:27.640 --> 19:33.860]  that the CloudWatch events doesn't cover, that could easily be substituted or, you know, something
[19:33.860 --> 19:39.260]  could be added on to it, which is an open source tool that does those checks or any additional
[19:39.260 --> 19:43.960]  stuff that you want to do. Yeah, definitely. I mean, if you look at the PolarCore right now,
[19:43.960 --> 19:49.980]  it's doing similar stuff, but doing like a sanity check every night, let's say,
[19:49.980 --> 19:54.260]  it's doing the same RDS check, describe RDS, describe Snapshot, describe AMI, and also
[19:54.260 --> 19:59.000]  including the other checks, like IAM user, if they are not using the access keys and stuff. So
[19:59.000 --> 20:03.260]  we have clubbed everything, but let's say if you don't want to use something, some of the stuff,
[20:03.260 --> 20:07.680]  you can basically remove it. So it will, like, do sanity check, like, once a night, but you can
[20:07.680 --> 20:13.780]  modify the cron schedule. Let's say you want to do your, like, once an hour or once every four
[20:13.780 --> 20:18.540]  hours, you can do that. It's very customizable. And also the whole deployment is all, like,
[20:18.540 --> 20:22.680]  data form and matchstrips. So it's very easy to deploy and easy to destroy.
[20:23.920 --> 20:32.880]  Got it. We have a couple of questions here from the chat, from John. In the what are we watching
[20:32.880 --> 20:40.720]  slide under EC2, why are you watching IMDS v2 to ensure v1 is not used?
[20:41.220 --> 20:46.940]  Yeah, that's exactly it. So, you know, we have a large environment with lots of apps,
[20:46.940 --> 20:51.460]  and because it's medical apps, there's lots of, like, legacy components,
[20:51.460 --> 20:58.140]  and there's lots of, you know, complexity. So, yeah, the IMDS v2 stuff is mostly monitoring
[20:58.140 --> 21:04.600]  what is still using IMDS v1. And then so here, correct me if I'm wrong, it also flags when
[21:04.600 --> 21:08.460]  things are using IMDS v2 to, like, say it's compliant, but it doesn't, like,
[21:08.460 --> 21:12.820]  it's not automatically killing things. I don't believe it's written.
[21:12.820 --> 21:19.240]  If you're in a state where you can, like, you are, like, okay, I'm in that state where my EC2
[21:19.240 --> 21:24.840]  are supposed to be on IMDS v2, you can toggle the plug and stop and add the check for stop
[21:24.840 --> 21:34.020]  instances if they are not, like, on v1 or on v2. And then second question from Scott.
[21:34.360 --> 21:38.240]  Is the AWS remediation framework intended to complement or supplement
[21:38.240 --> 21:42.220]  the use of AWS config with its own remediation processes?
[21:42.780 --> 21:49.380]  So what we have, like, thought of, like, was to add config also as one of the events.
[21:49.380 --> 21:53.920]  Also, like, right now we have polling and we have event-driven. So we thought of, like,
[21:53.920 --> 21:57.920]  moving to config and, like, enumerate it, but we haven't, like, done the full enumeration part,
[21:57.920 --> 22:01.560]  like how config can, like, help us in achieving the similar state which we are achieving by
[22:01.560 --> 22:05.640]  event-driven plus polling. I'm sure, like, config has the custom rules and whatnot,
[22:05.640 --> 22:09.260]  which you can, like, add and do the remediation based on the custom lambdas.
[22:09.260 --> 22:13.640]  But, yeah, something which for to-do we want to, like, do, maybe add, like, a config integration,
[22:13.640 --> 22:17.300]  maybe if you want to have three options of, like, leveraging the same framework. Because
[22:17.920 --> 22:22.240]  even with custom model tool, you'll have to have, like, your lambdas to make the remediation.
[22:24.680 --> 22:25.400]  And so...
[22:25.400 --> 22:26.260]  Oh, go ahead.
[22:26.600 --> 22:27.540]  Oh, no, go ahead.
[22:27.620 --> 22:29.900]  I just wanted to ask, kind of going off of that,
[22:29.900 --> 22:35.400]  any other sneak peek or future stuff you have in the plan for this tool?
[22:35.400 --> 22:40.400]  Yeah. So, I mean, Sahir and I have both kind of mentioned it. One of the things that we're
[22:40.400 --> 22:45.320]  wanting to look at is more integration into other tools, as well as things like AWS config,
[22:45.320 --> 22:51.480]  to kind of leverage this stuff. We are a relatively small team. I think we have, well,
[22:51.480 --> 22:59.200]  we have three people, basically, kind of managing our pretty large cloud environment. So, you know,
[22:59.200 --> 23:07.000]  when it comes to leveraging existing tooling, our general philosophy is kind of use it all
[23:08.020 --> 23:14.220]  and automate the triage to clean stuff up. So, really, what we're trying to get into is,
[23:14.220 --> 23:19.320]  can we leverage existing checks and remediations from things like config,
[23:19.320 --> 23:23.560]  security monkey is dead, but, like, tools that are similar to security monkey and things like that,
[23:23.560 --> 23:27.300]  and leverage those checks and remediations instead of writing our own,
[23:27.300 --> 23:32.800]  using kind of like the rest APIs and all the fun, nice features that people are putting out there.
[23:32.980 --> 23:38.780]  So, you know, if we see a tool that makes this kind of stuff easy, the future state that we see
[23:38.780 --> 23:44.540]  for the remediation framework is kind of gobbling it up and consuming it to make this stuff more
[23:44.540 --> 23:50.740]  effective, giving us greater confidence that the detective capabilities are actually correct to,
[23:50.740 --> 23:55.120]  like, lower false positive rate, or just to make development quicker.
[23:58.000 --> 24:06.660]  Got it. And another question for me, you mentioned that your team over there is quite small.
[24:06.660 --> 24:13.380]  I know that in order to get to a place where you guys are at with this tool to remediate right away,
[24:13.380 --> 24:18.180]  or even with a certain delay, get those remediations going in the cloud environment,
[24:18.180 --> 24:24.800]  you need to have sort of a in-level kind of in-depth knowledge of your cloud environment,
[24:24.800 --> 24:30.660]  as well as buy-in from other teams. Any kind of insights as to how you got there,
[24:30.660 --> 24:37.400]  fun stories, or anything like that? I think me and Sahir can both give some insight into this. So
[24:37.400 --> 24:43.660]  Sahir has been at Flatiron much longer than me, and he is a very beloved resource among the entire
[24:43.660 --> 24:49.900]  company, where I am kind of known more as the jackass who comes in and tells everybody no all
[24:49.900 --> 24:56.440]  the time. But I like doing that. So one of the things that we found that really made these things
[24:56.440 --> 25:00.800]  very successful was communicating to our engineers, because, you know, they're engineers,
[25:00.800 --> 25:05.420]  they like to see cool shit. And like showing them that this is cool, it means that they can
[25:05.420 --> 25:18.600]  experiment more as they're developing and building without worry. Oh, we lost Justin.
[25:21.300 --> 25:27.380]  Justin, if you could try again for the past five seconds.
[25:27.440 --> 25:32.820]  Yeah, so that developers could experiment a little more without fear of exposing, you know,
[25:32.820 --> 25:38.500]  all of our patients' data. They actually really enjoyed that idea. And then we also found that
[25:38.500 --> 25:43.140]  when we showed them that we had the capabilities to do these things, they suddenly had all kinds
[25:43.140 --> 25:47.860]  of ideas as to how they could make use of it as part of their deployment, development, and testing
[25:47.860 --> 25:53.820]  processes. So being able to tear down test instances after, you know, 15 minutes after
[25:53.820 --> 25:58.640]  their tests completed, and things like that. So a lot of the way that we got buy-in from this stuff
[25:58.640 --> 26:03.540]  was showing them how not only did it make the company and our patients safer, it made their day
[26:03.540 --> 26:10.080]  easier and gave them the ability with, you know, a lot without having to build a whole new solution,
[26:10.080 --> 26:13.940]  they could like automate their testing processes and have it all be event driven,
[26:13.940 --> 26:19.220]  which was stuff that they wanted. So, you know, just being willing to be transparent and communicate
[26:19.220 --> 26:23.800]  very closely with them and be helpful and show them how this could be like a boon to their
[26:23.800 --> 26:29.600]  processes was really, really integral in getting people excited and on board with this, where we
[26:29.600 --> 26:34.100]  could just start flipping switches and turning remediations on, because suddenly, you know,
[26:34.100 --> 26:38.300]  they didn't have to worry about being the person who exposed an S3 bucket full of, you know,
[26:38.300 --> 26:44.100]  20 million patients or whatever. So here probably has another view on this.
[26:44.560 --> 26:48.640]  No, I think this new cover pretty much everything. Yeah, so it's more on, like,
[26:48.640 --> 26:53.420]  how are we going to give more flexibility to developers? Like, are we confident, like,
[26:53.420 --> 26:58.180]  making a segregated account where we can, like, let them do whatever they want and still have
[26:58.180 --> 27:03.320]  some sort of guardrails to watch for, some sort of guardrails to, like, detect things, along with
[27:03.320 --> 27:08.580]  SCPs and other preventative controls? Like, what if still things happen? Like, can we have, like,
[27:08.640 --> 27:13.440]  a framework which watches for, let's say, S3 bucket or AMIs and whatnot and fixes it automatically?
[27:13.720 --> 27:17.760]  So yeah, so ensuring that level of maturity is what we're aiming for. And I'm pretty sure, like,
[27:17.760 --> 27:24.160]  software engineering kind of is on board on that. And then any kind of quick insight into
[27:24.940 --> 27:32.940]  what the cost savings were like from v1 to v2? Oh, so we haven't, like, compared,
[27:32.940 --> 27:38.620]  like, specifically the cost saving. We did, like, actually save a lot from the API calls
[27:38.620 --> 27:43.800]  perspective. Like, if I, if you look back at the slides which we had from the v1, we were making
[27:43.800 --> 27:51.260]  tons of API call, like, for the most API calls spent was on snapshots, describing snapshots.
[27:51.260 --> 27:55.600]  We got to describe a snapshot and then you're described for each snapshot attribute. So it was,
[27:55.600 --> 28:01.520]  like, a very, very iterative. So we had, like, so many API calls which were impacting other stuff
[28:01.520 --> 28:05.480]  with, like, the whole situation where Ansible and stuff, because we were almost, like, getting
[28:05.480 --> 28:13.520]  pushed back on the EC2 resources, always hitting the limits. So v2 basically gave us some room,
[28:13.520 --> 28:18.240]  like, that we don't have to, like, describe a poll all the time. We can do it when certain
[28:18.240 --> 28:25.400]  events which we care about happens. So that was some of the more, like, API saving, the number
[28:25.400 --> 28:30.820]  of API call savings, but not specifically cost. That we haven't, like, compared yet.
[28:31.240 --> 28:37.180]  Yeah, the cost savings for me was our SREs quit calling, saying that API, you know, we were making
[28:37.180 --> 28:41.800]  too many API calls and people were getting rate limited, which is worth, like, a zillion dollars.
[28:41.800 --> 28:46.220]  Because, like, if there's... I haven't made this clear, the one thing that I don't want to be doing
[28:46.220 --> 28:52.300]  is being on call and getting calls at night pisses me off to no end, especially when it's an angry SRE,
[28:52.300 --> 28:56.260]  because the SREs do not give a shit what I have to say. They are mad that things aren't working.
[28:56.260 --> 29:02.440]  So I... that stopped and that is worth just... it's invaluable. That's really all I care about is a
[29:02.440 --> 29:08.980]  quick calling. I think we can all agree there. A couple more questions from the chat. Have you
[29:08.980 --> 29:15.180]  tested or compared this with AWS GuardDuty? So we do have... we do use, like, GuardDuty,
[29:15.180 --> 29:21.040]  but then GuardDuty, like, it doesn't capture everything what we want. Let's say, sure, it
[29:21.040 --> 29:25.040]  might have, like, for S3 checks and everything. But for, like, specific stuff, like, which we're
[29:25.040 --> 29:29.640]  looking for, like, IAM roles and whatnot, I'm not sure if it is being covered in GuardDuty, but we
[29:29.640 --> 29:36.880]  do complement that with, like, using GuardDuty, like, standalone. Yeah, to kind of... to add to that, you
[29:36.880 --> 29:42.220]  know, like we said before, I don't necessarily believe in replacing free things. If there's
[29:42.220 --> 29:47.420]  things that do things well, we will use them both as best we can and then try and, you know, write
[29:47.420 --> 29:52.440]  some code to glue it together. So, you know, we have a lot of... when you ask these questions about
[29:52.440 --> 29:56.600]  these things, very often the answer is we're using both of them as well as this.
[29:58.880 --> 30:05.740]  Got it. And another one, for some cases, why not create a preventative SCP?
[30:06.100 --> 30:12.680]  Sure, that's... yeah, this one, go ahead. Yeah, absolutely do that. You should do that. This is
[30:12.680 --> 30:19.240]  another layer helping us, you know, validate that the SCP hasn't changed or that I didn't fat finger
[30:19.240 --> 30:27.360]  the SCP or, honestly, I'm not that smart. I might have written a bad SCP and I kind of... I like
[30:27.360 --> 30:31.040]  having the many eyes approach and, like, the many, you know, the defense in depth approach of, you
[30:31.040 --> 30:39.140]  know, maybe I wrote a bad SCP and this is something we can use to detect... we can use to detect,
[30:39.140 --> 30:44.020]  you know, something going public again. So we have both. We're using SCPs. You should be using
[30:44.020 --> 30:49.380]  SCPs. They're a great preventative control. I also would say that having preventative controls
[30:49.380 --> 30:53.320]  does not mean you should avoid having detective and responsive controls as well. You know,
[30:53.320 --> 30:54.700]  you need multiple layers.
